library("tidyverse")
library("tidymodels")
library("tidylog")

# Reading in the Walmart dataset:

dfw <- read_csv("data/walmart.csv") %>% 
  rename_with(tolower) %>% 
  arrange(store)


head(dfw)
NA
str(dfw)
spc_tbl_ [6,435 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ store       : num [1:6435] 1 1 1 1 1 1 1 1 1 1 ...
 $ date        : Date[1:6435], format: "2010-04-16" "2012-04-06" ...
 $ isholiday   : logi [1:6435] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ temperature : num [1:6435] 66.3 70.4 87.2 42.3 84.8 ...
 $ fuel_price  : num [1:6435] 2.81 3.89 2.63 2.57 3.57 ...
 $ cpi         : num [1:6435] 210 221 212 211 222 ...
 $ unemployment: num [1:6435] 7.81 7.14 7.79 8.11 6.91 ...
 $ size        : num [1:6435] 151315 151315 151315 151315 151315 ...
 $ weekly_sales: num [1:6435] 1105515 1505325 837329 1112467 1085133 ...
 - attr(*, "spec")=
  .. cols(
  ..   Store = col_double(),
  ..   Date = col_date(format = ""),
  ..   IsHoliday = col_logical(),
  ..   Temperature = col_double(),
  ..   Fuel_Price = col_double(),
  ..   CPI = col_double(),
  ..   Unemployment = col_double(),
  ..   Size = col_double(),
  ..   Weekly_Sales = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

#The structure of the data and each column's data type.  Note that isholiday is a logical (i.e., true/false) predictor variable

str(dfw)
spc_tbl_ [6,435 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ store       : num [1:6435] 1 1 1 1 1 1 1 1 1 1 ...
 $ date        : Date[1:6435], format: "2010-04-16" "2012-04-06" ...
 $ isholiday   : logi [1:6435] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ temperature : num [1:6435] 66.3 70.4 87.2 42.3 84.8 ...
 $ fuel_price  : num [1:6435] 2.81 3.89 2.63 2.57 3.57 ...
 $ cpi         : num [1:6435] 210 221 212 211 222 ...
 $ unemployment: num [1:6435] 7.81 7.14 7.79 8.11 6.91 ...
 $ size        : num [1:6435] 151315 151315 151315 151315 151315 ...
 $ weekly_sales: num [1:6435] 1105515 1505325 837329 1112467 1085133 ...
 - attr(*, "spec")=
  .. cols(
  ..   Store = col_double(),
  ..   Date = col_date(format = ""),
  ..   IsHoliday = col_logical(),
  ..   Temperature = col_double(),
  ..   Fuel_Price = col_double(),
  ..   CPI = col_double(),
  ..   Unemployment = col_double(),
  ..   Size = col_double(),
  ..   Weekly_Sales = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Simple Linear Regression Model

We will begin by running a simple linear model that regresses weekly sales onto Consumer Price Index (CPI)

# Specifying our model type and setting the computational engine

linear_model <- 
  linear_reg() %>% 
  set_engine("lm")

# Fitting the model

fit_cpi <- 
  linear_model %>% 
  fit(weekly_sales ~ cpi, data = dfw)
  
# Model output 

summary(fit_cpi$fit)

Call:
stats::lm(formula = weekly_sales ~ cpi, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-662386 -318443  -73868  258442 2095880 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 827280.5    21778.4  37.986  < 2e-16 ***
cpi           -732.7      123.7  -5.923 3.33e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 390600 on 6433 degrees of freedom
Multiple R-squared:  0.005423,  Adjusted R-squared:  0.005269 
F-statistic: 35.08 on 1 and 6433 DF,  p-value: 3.332e-09
In this model, a Walmart store with a theoretical square footage of 0 can expect its weekly sales to be ~$828,280 if CPI is held constant. We also observe that the relationship between Weekly_Sales and CPI is negative. That is, if CPI increases by one unit, weekly sales will decrease by ~$733; and if CPI decreases by one unit, sales would increase by ~$733.
In evaluating the model statistics, we can see an Adjusted R_Squared value of 0.005269. In other words, this model explains only roughly 0.5% of the variance in Walmart’s weekly sales. So, while our interpretation of the effect of CPI on Weekly_Sales is still valid, we must conclude that this model appears to fail in explaining the variance in our target variable.
# We'plot the affect of CPI on sales for a few different stores in the dataset, starting with store 10.

plot_store_10 <- 
  dfw %>% 
  filter(Store == 10) %>% 
  ggplot(aes(x = CPI, y = Weekly_Sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 10', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_10)
`geom_smooth()` using formula = 'y ~ x'
plot_store_11 <- 
  dfw %>% 
  filter(Store == 11) %>% 
  ggplot(aes(x = CPI, y = Weekly_Sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 11', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_11)
`geom_smooth()` using formula = 'y ~ x'
plot_store_12 <- 
  dfw %>% 
  filter(Store == 10) %>% 
  ggplot(aes(x = CPI, y = Weekly_Sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_12)
`geom_smooth()` using formula = 'y ~ x'
plot_store_13 <- 
  dfw %>% 
  filter(Store == 13) %>% 
  ggplot(aes(x = CPI, y = Weekly_Sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 13', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
filter: removed 6,292 rows (98%), 143 rows remaining
plotly::ggplotly(plot_store_13)
`geom_smooth()` using formula = 'y ~ x'

# A plot to demonstrate the fluctuation of CPI by region/store.  Note that the
# smoothed line is negative in some locales and positive in others.

animated_plot <- 
    dfw %>% 
    filter(store %in% c(11:15)) %>% 
    ggplot(aes(x = cpi, y = weekly_sales)) + 
    geom_point() + 
    geom_smooth(method = lm) +
    labs(title = 'Weekly Sales vs. CPI for Store {closest_state}', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
    theme_minimal() + 
    gganimate::transition_states(store, transition_length = 1, state_length = 2) +
    gganimate::view_follow()

animated_plot

Inserting image 1 at 0.00s (1%)...
Inserting image 2 at 0.10s (2%)...
Inserting image 3 at 0.20s (3%)...
Inserting image 4 at 0.30s (4%)...
Inserting image 5 at 0.40s (5%)...
Inserting image 6 at 0.50s (6%)...
Inserting image 7 at 0.60s (7%)...
Inserting image 8 at 0.70s (8%)...
Inserting image 9 at 0.80s (9%)...
Inserting image 10 at 0.90s (10%)...
Inserting image 11 at 1.00s (11%)...
Inserting image 12 at 1.10s (12%)...
Inserting image 13 at 1.20s (13%)...
Inserting image 14 at 1.30s (14%)...
Inserting image 15 at 1.40s (15%)...
Inserting image 16 at 1.50s (16%)...
Inserting image 17 at 1.60s (17%)...
Inserting image 18 at 1.70s (18%)...
Inserting image 19 at 1.80s (19%)...
Inserting image 20 at 1.90s (20%)...
Inserting image 21 at 2.00s (21%)...
Inserting image 22 at 2.10s (22%)...
Inserting image 23 at 2.20s (23%)...
Inserting image 24 at 2.30s (24%)...
Inserting image 25 at 2.40s (25%)...
Inserting image 26 at 2.50s (26%)...
Inserting image 27 at 2.60s (27%)...
Inserting image 28 at 2.70s (28%)...
Inserting image 29 at 2.80s (29%)...
Inserting image 30 at 2.90s (30%)...
Inserting image 31 at 3.00s (31%)...
Inserting image 32 at 3.10s (32%)...
Inserting image 33 at 3.20s (33%)...
Inserting image 34 at 3.30s (34%)...
Inserting image 35 at 3.40s (35%)...
Inserting image 36 at 3.50s (36%)...
Inserting image 37 at 3.60s (37%)...
Inserting image 38 at 3.70s (38%)...
Inserting image 39 at 3.80s (39%)...
Inserting image 40 at 3.90s (40%)...
Inserting image 41 at 4.00s (41%)...
Inserting image 42 at 4.10s (42%)...
Inserting image 43 at 4.20s (43%)...
Inserting image 44 at 4.30s (44%)...
Inserting image 45 at 4.40s (45%)...
Inserting image 46 at 4.50s (46%)...
Inserting image 47 at 4.60s (47%)...
Inserting image 48 at 4.70s (48%)...
Inserting image 49 at 4.80s (49%)...
Inserting image 50 at 4.90s (50%)...
Inserting image 51 at 5.00s (51%)...
Inserting image 52 at 5.10s (52%)...
Inserting image 53 at 5.20s (53%)...
Inserting image 54 at 5.30s (54%)...
Inserting image 55 at 5.40s (55%)...
Inserting image 56 at 5.50s (56%)...
Inserting image 57 at 5.60s (57%)...
Inserting image 58 at 5.70s (58%)...
Inserting image 59 at 5.80s (59%)...
Inserting image 60 at 5.90s (60%)...
Inserting image 61 at 6.00s (61%)...
Inserting image 62 at 6.10s (62%)...
Inserting image 63 at 6.20s (63%)...
Inserting image 64 at 6.30s (64%)...
Inserting image 65 at 6.40s (65%)...
Inserting image 66 at 6.50s (66%)...
Inserting image 67 at 6.60s (67%)...
Inserting image 68 at 6.70s (68%)...
Inserting image 69 at 6.80s (69%)...
Inserting image 70 at 6.90s (70%)...
Inserting image 71 at 7.00s (71%)...
Inserting image 72 at 7.10s (72%)...
Inserting image 73 at 7.20s (73%)...
Inserting image 74 at 7.30s (74%)...
Inserting image 75 at 7.40s (75%)...
Inserting image 76 at 7.50s (76%)...
Inserting image 77 at 7.60s (77%)...
Inserting image 78 at 7.70s (78%)...
Inserting image 79 at 7.80s (79%)...
Inserting image 80 at 7.90s (80%)...
Inserting image 81 at 8.00s (81%)...
Inserting image 82 at 8.10s (82%)...
Inserting image 83 at 8.20s (83%)...
Inserting image 84 at 8.30s (84%)...
Inserting image 85 at 8.40s (85%)...
Inserting image 86 at 8.50s (86%)...
Inserting image 87 at 8.60s (87%)...
Inserting image 88 at 8.70s (88%)...
Inserting image 89 at 8.80s (89%)...
Inserting image 90 at 8.90s (90%)...
Inserting image 91 at 9.00s (91%)...
Inserting image 92 at 9.10s (92%)...
Inserting image 93 at 9.20s (93%)...
Inserting image 94 at 9.30s (94%)...
Inserting image 95 at 9.40s (95%)...
Inserting image 96 at 9.50s (96%)...
Inserting image 97 at 9.60s (97%)...
Inserting image 98 at 9.70s (98%)...
Inserting image 99 at 9.80s (99%)...
Inserting image 100 at 9.90s (100%)...
Encoding to gif... done!
What we observe here is that the impact of CPI can vary greatly by store/region. This still aligns with our evaluation of fit_cpi because we recall that that particular model explained only a small amount (~5%) of the variance in Weekly_Sales, so we would expect to see these kinds of swings. With a (much) higher Adjusted R-Squared, these variations would look unusual.

dfw %>%
  group_by(store) %>%
  group_modify(~ tidy(lm(weekly_sales ~ ., data = .x))) %>%
  filter(term == "cpi")
NA
# Filtering for 2012 and plotting CPI against Weekly_Sales

plot <- dfw %>%
            filter(lubridate::year(date) == 2012) %>%
              ggplot(aes(x=cpi, y = weekly_sales)) +
              geom_point() +
                geom_smooth(method=lm)
    labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
NULL
    
plotly::ggplotly(plot)
NA
We see an interesting effect when we filter for one specific year. The clusters are nearly vertical because CPI is calculated geographically, with either Core Based Statistical Area (CBSA) or Metropolitan Statistical Area (MSA). CPI might be the same in a particular region, but different stores in that region will have different sales volume, hence the vertical clusters.
plot_store_cpi <- dfw %>%
            filter(store==10, lubridate::year(date)==2012) %>%
            ggplot(aes(x=cpi, y = weekly_sales)) +
            geom_point() +
            geom_smooth(method=lm)
Quitting from lines 189-198 (lab-II-template.Rmd) 
    labs(title = 'Weekly Sales vs. CPI for Store 10 in 2012', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
NULL
plotly::ggplotly(plot_store_cpi)
Although CPI varies by region, the deviation in CPI across time for a single region tends to be much lower, which is why we see such a slim range here. Since CPI is a measure of inflation, we expect to see these regional effects.

# A new iteration of the previous model that also includes store Size as an 
# independent variable

options(scipen = 999)

fit_cpi_size <- 
  linear_model %>% 
  fit(weekly_sales ~ cpi + size, data = dfw)

summary(fit_cpi_size$fit)

Call:
stats::lm(formula = weekly_sales ~ cpi + size, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-563750 -167145  -29612  112172 1912650 

Coefficients:
                Estimate   Std. Error t value            Pr(>|t|)
(Intercept) 182831.50332  14966.91316  12.216 <0.0000000000000002
cpi           -657.04633     76.92121  -8.542 <0.0000000000000002
size             4.84669      0.04796 101.048 <0.0000000000000002
               
(Intercept) ***
cpi         ***
size        ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 242800 on 6432 degrees of freedom
Multiple R-squared:  0.6156,    Adjusted R-squared:  0.6155 
F-statistic:  5151 on 2 and 6432 DF,  p-value: < 0.00000000000000022
# Comparing fit_cpi to fit_size to see which is better at explaining the variance
# in Weekly_Sales

anova(fit_cpi$fit, fit_cpi_size$fit)
Analysis of Variance Table

Model 1: Weekly_Sales ~ CPI
Model 2: Weekly_Sales ~ CPI + Size
  Res.Df        RSS Df  Sum of Sq     F    Pr(>F)    
1   6433 9.8128e+14                                  
2   6432 3.7924e+14  1 6.0204e+14 10211 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The model that includes size as a predictor variable (fit_cpi_size) appears to perform significantly better than fit_cpi. Adjusted R-Square now explains ~62% of the variance in rentals and the ANOVA test confirms that including size is statistically significant.
tidy(fit_cpi$fit)
tidy(fit_cpi_size$fit)
Note also that the coefficient in the revised model has been reduced from ~$733 to ~$657. This is simply due to the fact that size is now explaining more of the variance that was left unexplained by the previous model that only included CPI.
# Building a model that uses all variables EXCEPT Date and Store

fit_full <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date, data = dfw)

summary(fit_full$fit)

Call:
stats::lm(formula = weekly_sales ~ . - store - date, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-557148 -165608  -24125  112851 1918479 

Coefficients:
                  Estimate   Std. Error t value
(Intercept)   313268.64179  35462.52571   8.834
isholidayTRUE  60120.82723  11961.41705   5.026
temperature     1001.63556    173.85005   5.761
fuel_price    -13332.38798   6821.86906  -1.954
cpi             -946.07156     84.45115 -11.203
unemployment  -12517.11110   1724.67790  -7.258
size               4.83971      0.04802 100.786
                          Pr(>|t|)    
(Intercept)   < 0.0000000000000002 ***
isholidayTRUE     0.00000051374714 ***
temperature       0.00000000872334 ***
fuel_price                  0.0507 .  
cpi           < 0.0000000000000002 ***
unemployment      0.00000000000044 ***
size          < 0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 241200 on 6428 degrees of freedom
Multiple R-squared:  0.621, Adjusted R-squared:  0.6206 
F-statistic:  1755 on 6 and 6428 DF,  p-value: < 0.00000000000000022
anova(fit_cpi_size$fit, fit_full$fit)
Analysis of Variance Table

Model 1: Weekly_Sales ~ CPI + Size
Model 2: Weekly_Sales ~ (Store + Date + IsHoliday + Temperature + Fuel_Price + 
    CPI + Unemployment + Size) - Store - Date
  Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
1   6432 3.7924e+14                                   
2   6428 3.7394e+14  4 5.3028e+12 22.789 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We observe a further, though slight improvement in the Adjusted R-Squared value in the new model that eliminates temporal and regional effects (fit_full). The ANOVA test also confirms that the improvement in explanatory power is indeed statistically significant.

More Linear Regression

We hypothesize that the effect of good weather is increased on holidays. We can test this by revising fit_full and including an interaction term.
fit_full_int <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date + isholiday * temperature, data = dfw)

summary(fit_full_int$fit)

Call:
stats::lm(formula = weekly_sales ~ . - store - date + isholiday * 
    temperature, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-557499 -165415  -24493  112914 1918376 

Coefficients:
                              Estimate   Std. Error t value
(Intercept)               314781.71540  35650.02554   8.830
isholidayTRUE              47453.37499  32654.55081   1.453
temperature                  980.88764    180.84372   5.424
fuel_price                -13421.90341   6825.68548  -1.966
cpi                         -945.95168     84.45707 -11.200
unemployment              -12511.43706   1724.84245  -7.254
size                           4.83969      0.04802 100.779
isholidayTRUE:temperature    247.28642    593.15059   0.417
                                      Pr(>|t|)    
(Intercept)               < 0.0000000000000002 ***
isholidayTRUE                           0.1462    
temperature                  0.000000060421363 ***
fuel_price                              0.0493 *  
cpi                       < 0.0000000000000002 ***
unemployment                 0.000000000000453 ***
size                      < 0.0000000000000002 ***
isholidayTRUE:temperature               0.6768    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 241200 on 6427 degrees of freedom
Multiple R-squared:  0.621, Adjusted R-squared:  0.6206 
F-statistic:  1504 on 7 and 6427 DF,  p-value: < 0.00000000000000022
anova(fit_full$fit, fit_full_int$fit)
Analysis of Variance Table

Model 1: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
    cpi + unemployment + size) - store - date
Model 2: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
    cpi + unemployment + size) - store - date + isholiday * temperature
  Res.Df             RSS Df   Sum of Sq      F Pr(>F)
1   6428 373938635272534                             
2   6427 373928522950003  1 10112322531 0.1738 0.6768
Although the results of our fit_full_int model demonstrate that the effect of good weather is indeed more significant on holidays, the ANOVA test shows no statistically significant improvement. We cannot assert definitively that this model with the interaction term is an improvement.
We’ll also test whether the effect of temperature on weekly sales is linear by squaring that variable.
fit_full_sq <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date + I(temperature ^2), data = dfw)

summary(fit_full_sq$fit)

Call:
stats::lm(formula = weekly_sales ~ . - store - date + I(temperature^2), 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-561455 -165260  -24674  112058 1911166 

Coefficients:
                     Estimate   Std. Error t value
(Intercept)      261043.07789  41108.23517   6.350
isholidayTRUE     62296.69695  11987.90765   5.197
temperature        3293.89228    930.05089   3.542
fuel_price       -14713.82683   6841.25650  -2.151
cpi                -954.71915     84.48673 -11.300
unemployment     -12529.36093   1723.97501  -7.268
size                  4.83146      0.04811 100.420
I(temperature^2)    -19.82165      7.90072  -2.509
                             Pr(>|t|)    
(Intercept)         0.000000000229814 ***
isholidayTRUE       0.000000209190349 ***
temperature                    0.0004 ***
fuel_price                     0.0315 *  
cpi              < 0.0000000000000002 ***
unemployment        0.000000000000409 ***
size             < 0.0000000000000002 ***
I(temperature^2)               0.0121 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 241100 on 6427 degrees of freedom
Multiple R-squared:  0.6214,    Adjusted R-squared:  0.621 
F-statistic:  1507 on 7 and 6427 DF,  p-value: < 0.00000000000000022
anova(fit_full$fit, fit_full_sq$fit)
Analysis of Variance Table

Model 1: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
    cpi + unemployment + size) - store - date
Model 2: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
    cpi + unemployment + size) - store - date + I(temperature^2)
  Res.Df             RSS Df    Sum of Sq      F  Pr(>F)  
1   6428 373938635272534                                 
2   6427 373572776702821  1 365858569713 6.2943 0.01214 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## Plotting the relationship between Temperature^2 and Weekly Sales 

dfw %>%
  ggplot(aes(x = temperature, y = weekly_sales)) + 
  geom_smooth(method = "lm", formula = y ~ x + I(x^2))

The model output demonstrates a curvilinear, or inverted U-shaped relationship (visualized below). People are less likely to shop retail on a freezing cold day. Increasing temperatures are associated with increased sales, but only to a point. As temperatures become excessive and dangerous, sales start to decrease.
If we were managing Walmart’s promotions we could offer larger discounts when the whether is at either extreme and perhaps even increase the price of certain products when the temperature is mild.

Predictive Analytics

Now that we have a model that is fairly robust we will use it to make predictions of weekly sales revenue.
# Setting seed for reproducibility

set.seed(3.14159)

# Splitting the data set into a training dataset (75%) and a test dataset (25%)

dfw_split <- initial_split(dfw)

dfw_train <-  training(dfw_split)

dfw_test <- testing(dfw_split)
# Fitting the model

fit_org <- 
  linear_model %>% 
  fit(weekly_sales ~ . - date - store + I(temperature^2), data = dfw_train)

summary(fit_org$fit)

Call:
stats::lm(formula = weekly_sales ~ . - date - store + I(temperature^2), 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-557260 -165114  -25112  115048 1913671 

Coefficients:
                     Estimate   Std. Error t value
(Intercept)      254646.56714  47252.27399   5.389
isholidayTRUE     60380.17646  13967.26022   4.323
temperature        3056.33287   1068.34983   2.861
fuel_price       -19393.30976   7819.05402  -2.480
cpi                -921.70456     96.40208  -9.561
unemployment     -10579.94848   1991.83421  -5.312
size                  4.82603      0.05496  87.809
I(temperature^2)    -16.27913      9.05811  -1.797
                             Pr(>|t|)    
(Intercept)              0.0000000742 ***
isholidayTRUE            0.0000157042 ***
temperature                   0.00424 ** 
fuel_price                    0.01316 *  
cpi              < 0.0000000000000002 ***
unemployment             0.0000001135 ***
size             < 0.0000000000000002 ***
I(temperature^2)              0.07237 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 239000 on 4818 degrees of freedom
Multiple R-squared:  0.6248,    Adjusted R-squared:  0.6242 
F-statistic:  1146 on 7 and 4818 DF,  p-value: < 0.00000000000000022
  
# The linear regression output as a tibble

tidy(fit_org)

# Creating a new dataframe with predicted values

results_org <-
  predict(fit_org, new_data = dfw_test) %>% 
  bind_cols(dfw_test) %>% 
  rename(Predicted_Sales = .pred)

results_org %>% 
  arrange(date)
NA
# Defining the metric set we will be working with to evaluate the models

perf_metrics <- metric_set(rmse, mae)

# Calculating the performance of fit

perf_metrics(results_org, truth =  weekly_sales, estimate = Predicted_Sales)
These metrics indicate that our model is off by ~$240,424 according to RMSE and ~$179,092 MAE. These numbers appear alarming until one recalls that the range of weekly sales by Walmart store location is about $70k to $2.8 million, with a mean of $740k and a median of $689k.
#Building the model without I(Temperature^2) variable using only the training data set

fit_org_nosq <-
  linear_model %>% 
  fit(weekly_sales ~ . - store - date, data = dfw_train)
summary(fit_org_nosq$fit)

Call:
stats::lm(formula = weekly_sales ~ . - store - date, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-553813 -165778  -24194  114613 1919644 

Coefficients:
                  Estimate   Std. Error t value
(Intercept)   298136.37508  40595.03611   7.344
isholidayTRUE  58457.32991  13929.44258   4.197
temperature     1170.15986    199.77660   5.857
fuel_price    -18416.04919   7801.92738  -2.360
cpi             -913.65905     96.32035  -9.486
unemployment  -10562.20302   1992.27052  -5.302
size               4.83249      0.05486  88.094
                          Pr(>|t|)    
(Intercept)      0.000000000000242 ***
isholidayTRUE    0.000027573928584 ***
temperature      0.000000005015608 ***
fuel_price                  0.0183 *  
cpi           < 0.0000000000000002 ***
unemployment     0.000000119924903 ***
size          < 0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 239000 on 4819 degrees of freedom
Multiple R-squared:  0.6245,    Adjusted R-squared:  0.624 
F-statistic:  1336 on 6 and 4819 DF,  p-value: < 0.00000000000000022
#Comparing the models

anova(fit_org$fit, fit_org_nosq$fit)
Analysis of Variance Table

Model 1: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
    cpi + unemployment + size) - date - store + I(temperature^2)
Model 2: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
    cpi + unemployment + size) - store - date
  Res.Df             RSS Df     Sum of Sq      F  Pr(>F)  
1   4818 275144420434253                                  
2   4819 275328871300499 -1 -184450866246 3.2299 0.07237 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Creating a new dataframe results_org_nosq with predictions

results_org_nosq <-
  predict(fit_org_nosq, new_data = dfw_test) %>% 
  bind_cols(dfw_test) %>% 
  rename(Predicted_Sales = .pred)

#Calculating performance metrics

perf_metrics(results_org_nosq, truth =  weekly_sales, estimate = Predicted_Sales)
When we remove the temperature term, something notable occurs. First, our Adjusted R-Squared value diminishes slighlty (from 62.2% to 62.1%), making it a slightly less appealing model in terms of explaining the variance in weekly sales. However, we also observe that the error has been reduced, making fit_nosq relatively superior in terms of predictive capability. Since we are trying to build a reliable predictive model, we exclude the term and conclude that fit_nosq is better for that purpose.

More Predictive Modeling

We are fairly pleased with both the explanatory and predictive power of fit_nosq but of course we would like to improve upon both metrics. One issue that we have not yet discussed is the variability in weekly sales across Walmart locations, as shown below.
# Calculating total weekly sales per store

sales_by_store <- aggregate(weekly_sales ~ store, data = dfw, sum)

# A bar chart showing the distribution of weekly sales revenue by store

ggplot(sales_by_store, aes(x = store, y = weekly_sales)) +
  geom_bar(stat = "identity", fill = "#0078D4") +
  labs(title = "Total Weekly Sales by Store", x = "Store Number", y = "Total Weekly Sales")

Viewing the bar chart, we can see that standardizing the weekly_sales variable could improve our model. One way we could do this is by transforming the scale of weekly sales, transforming each value to its natural logarithmic value. We’ll use the log() function to accomplish this.
# Log model with all other variables unchanged
dfw_log <-
  dfw %>% 
  mutate(log_sales = log(weekly_sales))

dfw_log
set.seed(3.14159)

dfwlog_split <- initial_split(dfw_log)
dfwlog_train <- training(dfwlog_split)
dfwlog_test <- testing(dfwlog_split)
fit_log <- 
  linear_model %>% 
  fit(log_sales ~ . - store - date - weekly_sales, data=dfwlog_train)

summary(fit_log$fit)

Call:
stats::lm(formula = log_sales ~ . - store - date - weekly_sales, 
    data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.27631 -0.22829 -0.01924  0.22924  1.48007 

Coefficients:
                    Estimate     Std. Error t value
(Intercept)   12.46613737452  0.05575357064 223.594
isholidayTRUE  0.06377635244  0.01913081587   3.334
temperature    0.00047638876  0.00027437490   1.736
fuel_price    -0.00700762675  0.01071523395  -0.654
cpi           -0.00118499668  0.00013228720  -8.958
unemployment  -0.00484947664  0.00273620141  -1.772
size           0.00000809484  0.00000007534 107.444
                          Pr(>|t|)    
(Intercept)   < 0.0000000000000002 ***
isholidayTRUE             0.000863 ***
temperature               0.082580 .  
fuel_price                0.513151    
cpi           < 0.0000000000000002 ***
unemployment              0.076401 .  
size          < 0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3283 on 4819 degrees of freedom
Multiple R-squared:  0.7109,    Adjusted R-squared:  0.7106 
F-statistic:  1975 on 6 and 4819 DF,  p-value: < 0.00000000000000022
This final model uses a log-linear regression to explain weekly Walmart sales using store-level, economic, and seasonal predictors. Applying a log transformation to weekly sales substantially improves model performance, yielding an adjusted R² of 0.71, and stabilizes variance across stores with vastly different revenue scales. Overall model fit is strong, with well-behaved residuals and a highly significant F-statistic, indicating that the included predictors jointly explain a meaningful share of sales variation.
Results show that store size is the dominant driver of weekly sales, dwarfing macroeconomic effects and confirming that physical scale largely determines revenue potential. Holiday weeks are associated with an average 6–7% increase in sales, validating the importance of seasonal demand spikes. Inflation, proxied by CPI, has a small but statistically significant negative relationship with sales, even after controlling for store characteristics. Temperature and unemployment exhibit modest effects consistent with economic intuition, while fuel prices do not appear to meaningfully impact sales once other factors are accounted for.
This specification represents the best balance between interpretability, explanatory power, and robustness among the models tested. The log transformation enables clear percentage-based interpretations while materially improving fit relative to linear alternatives, making the model suitable for both analytical insight and downstream forecasting.

Limitiations

- The model does not explicitly account for store-level fixed effects or regional hierarchies, which may mask persistent location-specific dynamics.
- Temporal structure is handled implicitly; autocorrelation and seasonality are not directly modeled.
- The analysis assumes linear relationships on the log scale and may understate nonlinear or interaction effects beyond those tested.
- CPI and unemployment are measured at broader geographic levels and may not fully capture local economic conditions.

Next Steps

- Implement mixed-effects (hierarchical) models to capture store-specific variation.
- Explore time-series approaches for improved short-term forecasting.
- Incorporate promotional data, foot traffic, or local demographic variables to enhance predictive accuracy.
---
title: "Walmart's Weekly Sales Linear Regression"
output:
  html_document:
    df_print: paged
  pdf_document:
    latex_engine: xelatex
  html_notebook: default
always_allow_html: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

```

***

```{r}
library("tidyverse")
library("tidymodels")
library("tidylog")

```


```{r}

# Reading in the Walmart dataset:

dfw <- read_csv("data/walmart.csv") %>% 
  rename_with(tolower) %>% 
  arrange(store)


head(dfw)

```
```{r}
#An overview of our dataset's structure

str(dfw)
```

```{r}
#An overview of the data

head(dfw)
```

```{r}

#The structure of the data and each column's data type.  Note that isholiday is a logical (i.e., true/false) predictor variable

str(dfw)

```


<h1 style="text-align:center;">Simple Linear Regression Model</h1>

#### We will begin by running a simple linear model that regresses weekly sales onto Consumer Price Index (CPI)


```{r}
# Specifying our model type and setting the computational engine

linear_model <- 
  linear_reg() %>% 
  set_engine("lm")

# Fitting the model

fit_cpi <- 
  linear_model %>% 
  fit(weekly_sales ~ cpi, data = dfw)
  
# Model output 

summary(fit_cpi$fit)

```


##### In this model, a Walmart store with a theoretical square footage of 0 can expect its weekly sales to be **~$828,280** if CPI is held constant.  We also observe that the relationship between Weekly_Sales and CPI is negative.  That is, if CPI increases by one unit, weekly sales will decrease by ~$733; and if CPI decreases by one unit, sales would increase by ~$733.


##### In evaluating the model statistics, we can see an Adjusted R_Squared value of 0.005269. In other words, this model explains only roughly 0.5% of the variance in Walmart's weekly sales.  So, while our interpretation of the *effect* of CPI on Weekly_Sales is still valid, we must conclude that this model appears to fail in explaining the variance in our target variable.


```{r}
# Now we will plot the affect of CPI on sales for a few different stores in the dataset, starting with store 10.

plot_store_10 <- 
  dfw %>% 
  filter(store == 10) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 10', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()

plotly::ggplotly(plot_store_10)


plot_store_11 <- 
  dfw %>% 
  filter(store == 11) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 11', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()

plotly::ggplotly(plot_store_11)

plot_store_12 <- 
  dfw %>% 
  filter(store == 10) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()

plotly::ggplotly(plot_store_12)

plot_store_13 <- 
  dfw %>% 
  filter(store == 13) %>% 
  ggplot(aes(x = cpi, y = weekly_sales)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(title = 'Weekly Sales vs. CPI for Store 13', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()

plotly::ggplotly(plot_store_13)

```

```{r}

# A plot to demonstrate the fluctuation of CPI by region/store.  Note that the
# smoothed line is negative in some locales and positive in others.

animated_plot <- 
    dfw %>% 
    filter(store %in% c(11:15)) %>% 
    ggplot(aes(x = cpi, y = weekly_sales)) + 
    geom_point() + 
    geom_smooth(method = lm) +
    labs(title = 'Weekly Sales vs. CPI for Store {closest_state}', x = 'Consumer Price Index', y = 'Weekly Sales (USD)') +
    theme_minimal() + 
    gganimate::transition_states(store, transition_length = 1, state_length = 2) +
    gganimate::view_follow()

animated_plot
```


##### What we observe here is that the impact of CPI can vary greatly by store/region.  This still aligns with our evaluation of fit_cpi because we recall that that particular model explained only a small amount (~5%) of the variance in Weekly_Sales, so we would expect to see these kinds of swings.  With a (much) higher Adjusted R-Squared, these variations would look unusual.


```{r}

dfw %>%
  group_by(store) %>%
  group_modify(~ tidy(lm(weekly_sales ~ ., data = .x))) %>%
  filter(term == "cpi")

```


```{r}
# Filtering for 2012 and plotting CPI against Weekly_Sales

plot <- dfw %>%
	      	filter(lubridate::year(date) == 2012) %>%
	          ggplot(aes(x=cpi, y = weekly_sales)) +
	          geom_point() +
 	        	geom_smooth(method=lm)
    labs(title = 'Weekly Sales vs. CPI for Store 12', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()
    
plotly::ggplotly(plot)

```


##### We see an interesting effect when we filter for one specific year. The clusters are nearly vertical because CPI is calculated geographically, with either Core Based Statistical Area (CBSA) or Metropolitan Statistical Area (MSA).  CPI might be the same in a particular region, but different stores in that region will have different sales volume, hence the vertical clusters.


```{r}
# Now let's look exclusively at store 10

plot_store_cpi <- dfw %>%
      		filter(store==10, lubridate::year(date)==2012) %>%
      	    ggplot(aes(x=cpi, y = weekly_sales)) +
      	    geom_point() +
            geom_smooth(method=lm)
    labs(title = 'Weekly Sales vs. CPI for Store 10 in 2012', x = 'Consumer Price Index', y = 'Weekly   Sales (USD)') +
  theme_minimal()

plotly::ggplotly(plot_store_cpi)
```


##### Although CPI varies by region, the deviation in CPI across time for a single region tends to be much lower, which is why we see such a slim range here. Since CPI is a measure of inflation, we expect to see these regional effects.


```{r}

# A new iteration of the previous model that also includes store Size as an independent variable

fit_cpi_size <- 
  linear_model %>% 
  fit(weekly_sales ~ cpi + size, data = dfw)

summary(fit_cpi_size$fit)

```

```{r}
# Comparing fit_cpi to fit_size to see which is better at explaining the variance in Weekly_Sales

anova(fit_cpi$fit, fit_cpi_size$fit)
```


##### The model that includes size as a predictor variable (fit_cpi_size) appears to perform significantly better than fit_cpi.  Adjusted R-Square now explains ~62% of the variance in rentals and the ANOVA test confirms that including size is statistically significant.


```{r}
tidy(fit_cpi$fit)
tidy(fit_cpi_size$fit)
```


##### Note also that the coefficient in the revised model has been reduced from ~$733 to ~$657.  This is simply due to the fact that size is now explaining more of the variance that was left unexplained by the previous model that only included CPI.


```{r}
# Building a model that uses all variables EXCEPT Date and Store

fit_full <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date, data = dfw)

summary(fit_full$fit)

```
```{r}
anova(fit_cpi_size$fit, fit_full$fit)
```


##### We observe a further, though slight improvement in the Adjusted R-Squared value in the new model that eliminates temporal and regional effects (fit_full).  The ANOVA test also confirms that the improvement in explanatory power is indeed statistically significant.

<h1 style="text-align:center;">More Linear Regression</h1>

##### We hypothesize that the effect of good weather is increased on holidays.  We can test this by revising fit_full and including an interaction term.


```{r}
fit_full_int <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date + isholiday * temperature, data = dfw)

summary(fit_full_int$fit)
```
```{r}
anova(fit_full$fit, fit_full_int$fit)
```


##### Although the results of our fit_full_int model demonstrate that the effect of good weather is indeed more significant on holidays, the ANOVA test shows no statistically significant improvement.  We cannot assert definitively that this model with the interaction term is an improvement.

##### We'll also test whether the effect of temperature on weekly sales is linear by squaring that variable.


```{r}
fit_full_sq <- 
  linear_model %>% 
  fit(weekly_sales ~ . - store - date + I(temperature ^2), data = dfw)

summary(fit_full_sq$fit)
```
```{r}
anova(fit_full$fit, fit_full_sq$fit)
```

```{r}
## Plotting the relationship between Temperature^2 and Weekly Sales 

dfw %>%
  ggplot(aes(x = temperature, y = weekly_sales)) + 
  geom_smooth(method = "lm", formula = y ~ x + I(x^2))

```


##### The model output demonstrates a curvilinear, or inverted U-shaped relationship (visualized below).  People are less likely to shop retail on a freezing cold day. Increasing temperatures are associated with increased sales, but only to a point.  As temperatures become excessive and dangerous, sales start to decrease.

##### If we were managing Walmart's promotions we could offer larger discounts when the whether is at either extreme and perhaps even increase the price of certain products when the temperature is mild.

<h1 style="text-align:center;">Predictive Analytics</h1>

##### Now that we have a model that is fairly robust we will use it to make predictions of weekly sales revenue.


```{r}
# Setting seed for reproducibility

set.seed(3.14159)

# Splitting the data set into a training dataset (75%) and a test dataset (25%)

dfw_split <- initial_split(dfw)

dfw_train <-  training(dfw_split)

dfw_test <- testing(dfw_split)
```

```{r}
# Fitting the model

fit_org <- 
  linear_model %>% 
  fit(weekly_sales ~ . - date - store + I(temperature^2), data = dfw_train)

summary(fit_org$fit)
  
```

```{r}
# The linear regression output as a tibble

tidy(fit_org)
```
```{r}

# Creating a new dataframe with predicted values

results_org <-
  predict(fit_org, new_data = dfw_test) %>% 
  bind_cols(dfw_test) %>% 
  rename(Predicted_Sales = .pred)

results_org %>% 
  arrange(date)

```

```{r}
# Defining the metric set we will be working with to evaluate the models

perf_metrics <- metric_set(rmse, mae)
```


```{r}

# Calculating the performance of fit

perf_metrics(results_org, truth =  weekly_sales, estimate = Predicted_Sales)
```


##### These metrics indicate that our model is off by ~$240,424 according to RMSE and ~$179,092 MAE.  These numbers appear alarming until one recalls that the range of weekly sales by Walmart store location is about $70k to $2.8 million, with a mean of $740k and a median of $689k.


```{r}
#Building the model without I(Temperature^2) variable using only the training data set

fit_org_nosq <-
  linear_model %>% 
  fit(weekly_sales ~ . - store - date, data = dfw_train)

```

```{r}
summary(fit_org_nosq$fit)
```


```{r}
#Comparing the models

anova(fit_org$fit, fit_org_nosq$fit)
```


```{r}
#Creating a new dataframe results_org_nosq with predictions

results_org_nosq <-
  predict(fit_org_nosq, new_data = dfw_test) %>% 
  bind_cols(dfw_test) %>% 
  rename(Predicted_Sales = .pred)

#Calculating performance metrics

perf_metrics(results_org_nosq, truth =  weekly_sales, estimate = Predicted_Sales)
```


##### When we remove the temperature term, something notable occurs.  First, our Adjusted R-Squared value diminishes slighlty (from 62.2% to 62.1%), making it a slightly less appealing model in terms of explaining the variance in weekly sales.  However, we also observe that the error has been reduced, making fit_nosq relatively superior in terms of predictive capability. Since we are trying to build a reliable *predictive* model, we exclude the term and conclude that fit_nosq is better for that purpose.

<h1 style="text-align:center;">More Predictive Modeling</h1>

##### We are fairly pleased with both the explanatory and predictive power of fit_nosq but of course we would like to improve upon both metrics.  One issue that we have not yet discussed is the variability in weekly sales across Walmart locations, as shown below.


```{r}
# Calculating total weekly sales per store

sales_by_store <- aggregate(weekly_sales ~ store, data = dfw, sum)

# A bar chart showing the distribution of weekly sales revenue by store

ggplot(sales_by_store, aes(x = store, y = weekly_sales)) +
  geom_bar(stat = "identity", fill = "#0078D4") +
  labs(title = "Total Weekly Sales by Store", x = "Store Number", y = "Total Weekly Sales")

```


##### Viewing the bar chart, we can see that standardizing the weekly_sales variable could improve our model.  One way we could do this is by transforming the scale of weekly sales, transforming each value to its natural logarithmic value.  We'll use the log() function to accomplish this.


```{r}
# Log model with all other variables unchanged
dfw_log <-
  dfw %>% 
  mutate(log_sales = log(weekly_sales))

dfw_log
```
```{r}
set.seed(3.14159)

dfwlog_split <- initial_split(dfw_log)
dfwlog_train <- training(dfwlog_split)
dfwlog_test <- testing(dfwlog_split)
```

```{r}
fit_log <- 
  linear_model %>% 
  fit(log_sales ~ . - store - date - weekly_sales, data=dfwlog_train)

summary(fit_log$fit)
```
##### This final model uses a log-linear regression to explain weekly Walmart sales using store-level, economic, and seasonal predictors. Applying a log transformation to weekly sales substantially improves model performance, yielding an adjusted R² of 0.71, and stabilizes variance across stores with vastly different revenue scales. Overall model fit is strong, with well-behaved residuals and a highly significant F-statistic, indicating that the included predictors jointly explain a meaningful share of sales variation.
```{r}
```
##### Results show that store size is the dominant driver of weekly sales, dwarfing macroeconomic effects and confirming that physical scale largely determines revenue potential. Holiday weeks are associated with an average 6–7% increase in sales, validating the importance of seasonal demand spikes. Inflation, proxied by CPI, has a small but statistically significant negative relationship with sales, even after controlling for store characteristics. Temperature and unemployment exhibit modest effects consistent with economic intuition, while fuel prices do not appear to meaningfully impact sales once other factors are accounted for.
```{r}
```
##### This specification represents the best balance between interpretability, explanatory power, and robustness among the models tested. The log transformation enables clear percentage-based interpretations while materially improving fit relative to linear alternatives, making the model suitable for both analytical insight and downstream forecasting.

<h1 style="text-align:center;">Limitiations</h1>

##### - The model does not explicitly account for store-level fixed effects or regional hierarchies, which may mask persistent location-specific dynamics.

##### - Temporal structure is handled implicitly; autocorrelation and seasonality are not directly modeled.

##### - The analysis assumes linear relationships on the log scale and may understate nonlinear or interaction effects beyond those tested.

##### - CPI and unemployment are measured at broader geographic levels and may not fully capture local economic conditions.

<h1 style="text-align:center;">Next Steps</h1>

##### - Implement mixed-effects (hierarchical) models to capture store-specific variation.

##### - Explore time-series approaches for improved short-term forecasting.

##### - Incorporate promotional data, foot traffic, or local demographic variables to enhance predictive accuracy.
